Skip to content

Conversation

kiddos
Copy link

@kiddos kiddos commented May 27, 2025

We sometimes need to store Unicode text in a fixed space (e.g., in a database column of type CHARACTER(32)). It's acceptable for the text to be truncated, but because we're dealing with Unicode, we can't simply treat the text as raw bytes and truncate it at 16 bytes — that might split a character in the middle. The function StringUtils.truncateToByteLength(String str, int maxBytes, Charset charset) helps handle this by safely truncating the string based on byte length while preserving valid character boundaries.

@ecki
Copy link

ecki commented May 28, 2025

Agree, very useful when dealing with UTF8 databases. Wonder if it should have a utf8 variant, where it does not have to re truncate, it can just look at the byte patterns at the border.

The current version does not deal with UTF16 code units properly. (Substring might cut them in half)

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello all,

I think you'll want tests that cover grapheme clusters to avoid problems like https://issues.apache.org/jira/browse/LANG-1770

@kiddos
Copy link
Author

kiddos commented May 28, 2025

I added some test cases for emoji characters 🚀✨🎉
I did some testing and found that current implementation the escape characters worked ("\uD83D\uDE80\u2728\uD83C\uDF89")
but "🚀✨🎉" doesn't

After adding <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> in pom.xml, "🚀✨🎉" seems to work.

@garydgregory
Copy link
Member

@kiddos
Please see my previous comment.

@kiddos
Copy link
Author

kiddos commented May 28, 2025

Oh, right.
it's just tricky to handle grapheme cluster.
the codePoint solution you mention does seems to work.
I'll add more tests using grapheme clusters.

@garydgregory
Copy link
Member

I'm not requesting support for grapheme cluster in the runtime, but we should set expectations in unit tests, whether they are supported or not. This is a larger discussion, which I raised in https://issues.apache.org/jira/browse/LANG-1770

@kiddos
Copy link
Author

kiddos commented Jun 29, 2025

for the test case, I only check if the final output is not null and the bytes size is actully smaller then specified byte size.
Is that ok?

}

@Test
void testTruncateToByteLength() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be separate test methods for different tests so they can pass or fail independently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can reuse the test class, but this is about 10 different tests that should each be a separate method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite in Commons IO is parameterized and takes into consideration the introduction of support for Unicode 15 in JDK 20 (grapheme clusters and so on): https://github.com/apache/commons-io/blob/b4ee32c53c0036429d64c0d6fe82a62a0fc6dae2/src/test/java/org/apache/commons/io/FileSystemTest.java#L171

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe habe one method testing the various null cases (more than at the moment) one for the usual single byte ascii cases one for bmp Tests and then one for Graphen cases and one for Dynamic. I would also avoid Emoji literals in source

@ppkarwasz
Copy link
Contributor

In apache/commons-io#781 I used a similar approach to introduce an equivalent functionality in Commons IO.

While approaches might differ, I think we can reuse the test cases.

}

@Test
void testTruncateToByteLength() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can reuse the test class, but this is about 10 different tests that should each be a separate method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants